There has been great recent advancement in human-computer chat. However, proper evaluation currently requires human judgements that produce notoriously high-variance metrics due to their inherent subjectivity. Furthermore, there is little standardization in the methods and labels used for evaluation, with an overall lack of work to compare and assess the validity of various evaluation approaches. As a consequence, existing evaluation results likely leave an incomplete picture of the strengths and weaknesses of open-domain chatbots. We aim towards a dimensional evaluation of human-computer chat that can reliably measure several distinct aspects of chat quality. To this end, we present our novel human evaluation method that quantifies the rate of several quality-related chatbot behaviors. Our results demonstrate our method to be more suitable for dimensional chat evaluation than alternative likert-style or comparative methods. We then use our validated method and existing methods to evaluate four open-domain chat models from the recent literature.
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
我们通过纳入通用依赖性(UD)的句法特征来瞄准直接零射击设置中的跨语言机器阅读理解(MRC)的任务,以及我们使用的关键功能是每个句子中的语法关系。虽然以前的工作已经证明了有效的语法引导MRC模型,但我们建议采用句子际句法关系,除了基本的句子关系外,还可以进一步利用MRC任务的多句子输入中的句法依赖性。在我们的方法中,我们构建了句子间依赖图(ISDG)连接依赖树以形成横跨句子的全局句法关系。然后,我们提出了编码全局依赖关系图的ISDG编码器,通过明确地通过一个跳和多跳依赖性路径来解决句子间关系。三个多语言MRC数据集(XQUAD,MLQA,Tydiqa-Goldp)的实验表明,我们仅对英语培训的编码器能够在涵盖8种语言的所有14个测试集中提高零射性能,最高可达3.8 F1 / 5.2 EM平均改善,以及某些语言的5.2 F1 / 11.2 em。进一步的分析表明,改进可以归因于跨语言上一致的句法路径上的注意力。
translated by 谷歌翻译
提高对话系统的用户体验通常需要密集的开发人员努力读取对话日志,运行统计分析,并激活系统缺点的相对重要性。本文介绍了一种自动分析对话日志的新方法,了解用户系统交互与总体对话质量之间的关系。与在话语级别质量预测上的事先工作不同,我们的方法了解每个互动的影响,没有话语级注释的整体用户评级,允许基于经验证据和低成本获得所得模型结论。我们的模型识别与Chatbot设置中的与整体对话质量有着强烈相关的交互。实验表明,我们模型的自动分析同意专家判决,使这项工作首先表明这种弱监督的话语级质量预测学习是高度可取的。
translated by 谷歌翻译
我们展示了一个基于逻辑推理的新型对话管理方法的聊天栏。除了帧对话一系列响应生成任务,我们将对话作为协作推断过程,其中扬声器共享信息以实时地合成新知识。我们的Chatbot管道在三个广泛的阶段完成了这种建模。第一阶段将用户话语转换为符号谓词表示。然后,第二阶段与更大的知识库结合使用这种结构化表示来合成使用有效的图形匹配来扫描新谓词。在第三阶段和最后阶段,我们的机器人选择一个小的谓词子集并将它们转化为英语响应。这种方法为了解用户输入的潜在语义,灵活的主动措施以及与对话背景相干的响应。
translated by 谷歌翻译
Context is vital for commonsense moral reasoning. "Lying to a friend" is wrong if it is meant to deceive them, but may be morally okay if it is intended to protect them. Such nuanced but salient contextual information can potentially flip the moral judgment of an action. Thus, we present ClarifyDelphi, an interactive system that elicits missing contexts of a moral situation by generating clarification questions such as "Why did you lie to your friend?". Our approach is inspired by the observation that questions whose potential answers lead to diverging moral judgments are the most informative. We learn to generate questions using Reinforcement Learning, by maximizing the divergence between moral judgements of hypothetical answers to a question. Human evaluation shows that our system generates more relevant, informative and defeasible questions compared to other question generation baselines. ClarifyDelphi assists informed moral reasoning processes by seeking additional morally consequential context to disambiguate social and moral situations.
translated by 谷歌翻译
Pre-trained language models, despite their rapid advancements powered by scale, still fall short of robust commonsense capabilities. And yet, scale appears to be the winning recipe; after all, the largest models seem to have acquired the largest amount of commonsense capabilities. Or is it? In this paper, we investigate the possibility of a seemingly impossible match: can smaller language models with dismal commonsense capabilities (i.e., GPT-2), ever win over models that are orders of magnitude larger and better (i.e., GPT-3), if the smaller models are powered with novel commonsense distillation algorithms? The key intellectual question we ask here is whether it is possible, if at all, to design a learning algorithm that does not benefit from scale, yet leads to a competitive level of commonsense acquisition. In this work, we study the generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce a novel commonsense distillation framework, I2D2, that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale models as the teacher model by two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-Tomic, that is of the largest and highest quality available to date.
translated by 谷歌翻译
Metaverse over wireless networks is an emerging use case of the sixth generation (6G) wireless systems, posing unprecedented challenges in terms of its multi-modal data transmissions with stringent latency and reliability requirements. Towards enabling this wireless metaverse, in this article we propose a novel semantic communication (SC) framework by decomposing the metaverse into human/machine agent-specific semantic multiverses (SMs). An SM stored at each agent comprises a semantic encoder and a generator, leveraging recent advances in generative artificial intelligence (AI). To improve communication efficiency, the encoder learns the semantic representations (SRs) of multi-modal data, while the generator learns how to manipulate them for locally rendering scenes and interactions in the metaverse. Since these learned SMs are biased towards local environments, their success hinges on synchronizing heterogeneous SMs in the background while communicating SRs in the foreground, turning the wireless metaverse problem into the problem of semantic multiverse communication (SMC). Based on this SMC architecture, we propose several promising algorithmic and analytic tools for modeling and designing SMC, ranging from distributed learning and multi-agent reinforcement learning (MARL) to signaling games and symbolic AI.
translated by 谷歌翻译
我们挑战AI模型,以“展示”对《纽约客》标题比赛的复杂多模式幽默的理解。具体而言,我们开发了三个精心限制的任务,以掌握图像和标题之间的潜在复杂和意外的关系,并且对人类经验的广泛品种产生了复杂和意外的寓意;这些是纽约口径卡通的标志。我们调查了直接将卡通像素和字幕输入的视觉和语言模型,以及仅通过提供图像的文本描述来规避图像处理的仅限语言模型。即使我们为卡通图像提供了丰富的多方面注释,我们也可以确定高质量的机器学习模型(例如,微调,175b参数语言模型)和人类之间的性能差距。我们公开发布我们的语料库,包括描述图像的位置/实体的注释,场景的不寻常以及对笑话的解释。
translated by 谷歌翻译
ICECUBE是一种用于检测1 GEV和1 PEV之间大气和天体中微子的光学传感器的立方公斤阵列,该阵列已部署1.45 km至2.45 km的南极的冰盖表面以下1.45 km至2.45 km。来自ICE探测器的事件的分类和重建在ICeCube数据分析中起着核心作用。重建和分类事件是一个挑战,这是由于探测器的几何形状,不均匀的散射和冰中光的吸收,并且低于100 GEV的光,每个事件产生的信号光子数量相对较少。为了应对这一挑战,可以将ICECUBE事件表示为点云图形,并将图形神经网络(GNN)作为分类和重建方法。 GNN能够将中微子事件与宇宙射线背景区分开,对不同的中微子事件类型进行分类,并重建沉积的能量,方向和相互作用顶点。基于仿真,我们提供了1-100 GEV能量范围的比较与当前ICECUBE分析中使用的当前最新最大似然技术,包括已知系统不确定性的影响。对于中微子事件分类,与当前的IceCube方法相比,GNN以固定的假阳性速率(FPR)提高了信号效率的18%。另外,GNN在固定信号效率下将FPR的降低超过8(低于半百分比)。对于能源,方向和相互作用顶点的重建,与当前最大似然技术相比,分辨率平均提高了13%-20%。当在GPU上运行时,GNN能够以几乎是2.7 kHz的中位数ICECUBE触发速率的速率处理ICECUBE事件,这打开了在在线搜索瞬态事件中使用低能量中微子的可能性。
translated by 谷歌翻译